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Abstract In-silico investigation of skin permeation is 
an important but also computationally demanding prob¬ 
lem. To resolve all scales involved in full detail will 
not only require exascale computing capacities but also 
suitable parallel algorithms. This article investigates 
the applicability of the time-parallel Parareal algorithm 
to a brick and mortar setup, a precursory problem to 
skin permeation. The C++ library Lib4PrM implement¬ 
ing Parareal is combined with the UG4 simulation frame¬ 
work, which provides the spatial discretization and par¬ 
allelization. The combination’s performance is studied 
with respect to convergence and speedup. It is con¬ 
firmed that anisotropies in the domain and jumps in dif¬ 
fusion coefficients only have a minor impact on Parareal’s 
convergence. The influence of load imbalances in time 
due to differences in number of iterations required by 
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the spatial solver as well as spatio-temporal weak scal¬ 
ing is discussed. 
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1 Introduction 

Permeation of chemical substances through human skin 
is an interesting and important process e.g. for the de¬ 
velopment of cosmetics or drugs. In-vitro studies with 
humans constitute the “gold standard” but they are ex¬ 
pensive and limited by ethical and practical concerns. 
Here, in-silico studies are a viable alternative. They 
have been successfully used in the past (cf. the reviews 
in mnu) and can be expected to become even more im¬ 
portant in the future. They allow for hypothesis testing 
and may lead to experiments through which effects not 
known today could be discovered. 

Yet, numerical simulations in this field are demand¬ 
ing in terms of computational resources. The problem 
covers vastly different physical scales and, in case a 
complex full-fledged model is used, massive computa¬ 
tional parallelism needs to be exploited to reach reason¬ 
able times-to-solutions. Therefore, many interesting as¬ 
pects such as substructures of lipid bilayers or networks 
of keratin filaments mm have not yet been investi¬ 
gated in three spatial dimensions (3D) using numerical 
simulations. Finally, modern imaging techniques make 
resolving a spectral range of a few nanometers possible, 
which results in “big data” for analyses. Understanding 
the functional mechanism of the skin is thus a candidate 
from the life sciences for applying exascale computing. 

Considering the technology trend towards more and 
more parallelism, the application of new parallel meth¬ 
ods to the problem investigated here becomes relevant. 
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Promising candidates for such methods are parallel-in¬ 
time integration methods that can add a direction of 
concurrency in addition to spatial parallelization, e.g., 
as used here, parallel multi-grid. In recent years, time- 
parallel methods have matured from a mainly math¬ 
ematical concept to an approach with demonstrated 
efficiency in massively parallel computations m- 
They have been listed as a direction for mathematical 
research with the potential to help reaching exascale 
computing El- 

One of the most widely investigated parallel-in-time 
methods is Parareal, introduced in 2001 by Lions, Ma- 
day and Turinici [19] . It has been used for benchmark 
problems motivated by applications from fields as di¬ 
verse as plasma physics [32], computational fluid dy¬ 
namics mm or quantum chemistry [7j. Improvements 
with respect to implementation are considered e.g. in 
|MT2] . Parareal’s most appreciated aspect is probably 
that it is non-intrusive and rather easy to integrate 
into existing codes. Its drawback, on the other hand, 
is a quite severe bound on achievable parallel efficiency. 
However, because several other “across-the-step” time- 
parallel methods share similar features with Parareal 
(e.g. PITA png, MGRIT [H] or PFASST [13]), study¬ 
ing Parareal’s performance often already gives impor¬ 
tant insights. 

Theoretical estimates for stability and convergence 
of Parareal for linear diffusive problems with constant 
coefficients can be found in [Tfi] , Theory for diffusive 
problems with constant coefficients can also be found 
in [6]. For 2D diffusion with space-time dependent coef¬ 
ficients, numerical experiments showed only a marginal 
reduction in convergence speed |31j . In [id], the small 
impact of a time dependent viscosity on Parareal’s con¬ 
vergence for an advection-diffusion problem is demon¬ 
strated. However, performance for a 3D diffusive prob¬ 
lem on a complex geometry with anisotropies has not 
yet been studied. 

In preparation for the eventual application of Para¬ 
real to skin permeation, this article provides an inves¬ 
tigation of Parareal’s performance for a 3D brick and 
mortar problem. From our point of view, this model 
serves as an excellent benchmark because it features 
challenges resulting from a complex anisotropic geom¬ 
etry and from jumping coefficients, which need to be 
resolved adaptively over long time intervals. Although 
locally the mathematical formulation of the brick and 
mortar problem is clear, the global picture is highly 
complex and linked to a real world application requiring 
a sound simulation infrastructure in terms of numerical 
methods and software. Here, we employ the simulation 
framework UG4 [33] for the spatial discretization and lin¬ 
ear solvers, for which excellent parallel scaling has been 


demonstrated [26]. We parallelize UG4’s serial tempo¬ 
ral solvers through the C++ Parareal library Lib4PrM, 
which is integrated as a plugin. 

The present article establishes the principle appli¬ 
cability of Parareal to the 3D brick and mortar prob¬ 
lem. In doing so, it identifies a set of relevant issues 
to be tackled in order to develop an improved parallel- 
in-time integrator that can deliver reasonable efficiency 
for the skin transport problem. In particular, load bal¬ 
ancing in time is identified as a critical issue when 
combining implicit methods for a complex PDE with 
Parareal. Because the number of iterations of the spa¬ 
tial solver varies in time, balancing temporal subinter¬ 
vals in Parareal simply by the number of time steps 
induces load imbalances which can affect speedup. 


2 Problem and methods 

As a test case, we study a simplified version of the 3D 
brick and mortar problem introduced earlier in mM- 
The benchmark is defined on a biphasic domain fl C R 3 
that consists of two disjoint subsets f? cor , f?iip C R 3 
representing the so called corneocyte and lipid phase 
respectively. To be more specific, Q is the interior of 
the union i7 cor U !?ii p of closures. On this geometry we 
solve a diffusion equation with space-dependent diffu¬ 
sion coefficients. The evolution of the drug concentra¬ 
tion c p (x, £), p £ {cor,lip}, is modeled by the equation 


d t Cp(x,t) = V ■ (D p (x)Vcp(x, t)) 


( 1 ) 


with x £ D p , t £ [0, T], and a phase-dependent diffusion 
coefficient 


d pW 


D c 

D 


lip 


. X £ I2 CO r, 

: x £ i? lip . 


( 2 ) 


\ 2 

For the simulation time we use T = — which is 

6D off 

the characteristic lag time of the problem. It is de¬ 
fined in terms of the membrane thickness A and the 
effective (homogenized) diffusion coefficient D c g [23j . 
In terms of boundary conditions we have an interior 
phase boundary r = I2 cor n Dy lp and an exterior bound¬ 
ary dQ. For the exterior boundary we consider a mix 
of Dirichlet and homogeneous Neumann conditions, i.e. 
<9i7 = <9L?d U 9I7n with 


c p( x ^)l 9 r 2 D = ff( x ), n • Vc p ( x , t)| ar?N = 0, (3) 

where n denotes the outward pointing normal vector 
on dD. At the phase boundaries T, the flux must be 
continuous, i.e. 

AipVcii p (x, t) ■ ii — D cor Vc CO r(x, t) • n. (4) 
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(a) with cornoecytes (yellow) (b) corneocytes removed 


Fig. 1 A sketch of the 3D brick and mortar problem is shown. The geometry has ten layers of corneocytes (h? cor , yellow) that 
are embedded in a matrix of lipid bilayers (!?ii p , blue). 


Along r, concentrations can be discontinuous. How¬ 
ever, they are often assumed to be linked by a Nernst’s 
equilibrium A" cor /] ip Cii p = c CO r- When A cor /] ip is con¬ 
stant, the model may be reformulated, e.g., in terms 
of a continuous concentration 122122!- Hence this work 
employs the simplifying assumption ci; p = c cor . Note 
that more complex situations with locally fluctuating 
or even concentration dependent coefficients also play 
an important role n- 

The geometry used in this article is a 3D brick and 
mortar configuration as depicted in Figure[I] It features 
ten layers of corneocytes (yellow) that are embedded in 
a matrix of lipid bilayers (blue). It is a simplified ver¬ 
sion of more elaborate tetrakaidekahedral models with 
hybrid grids presented previously [53]; the brick and 
mortar model only consist of hexahedra with a reduced 
level of anisotropy. However, since jumping coefficients 
are present, it features some of the issues one encoun¬ 
ters also in more complex situations. 

Figure [2] shows the computed solution at three dif¬ 
ferent times. To allow for a view into the interior of the 
domain, a cuboid representing a quarter of the total do¬ 
main has been removed in the representation. Initially 
(not shown), the solution is zero on the whole domain 
with a Dirichlet boundary condition of c p (x,i) = 1 on 
the top side. Then, the tracer starts to diffuse down¬ 
wards in the lipid channels and much slower in com¬ 
parison through the corneocytes. In the first subfigure, 
diffusion has just started and filled the upper half of the 
domain but the concentration is still essentially zero in 
the lower half. In the last subfigure, at the end of the 
simulation, the tracer has diffused down through the 
whole domain. Concentrations continue to change with 
smaller changes from step to step, eventually approach¬ 
ing a steady state. A prospective 3D model with a fully 
resolved lipid-bilayer substructure, as suggested for two 


dimensions in n, will feature a similar effect on an 
even smaller scale in time and space. 

2.1 Parareal 

Let the temporal domain [0, T\ be decomposed into N t 
subintervals [t 7 -, A+i], j = 0,..., N t — 1, such that t 0 = 
0, tjv t —x = T and 

[0, fr] U ... U T] = [0, T\. (5) 

Let C and J be a “coarse” and “fine” time integra¬ 
tion methoc0 with time step size At and St -C At, 
respectively. For the sake of simplicity assume that all 
subintervals have the same length and that a constant 
number of both St and At steps cover a subinterval ex¬ 
actly. Then, instead of serially integrating across [0,T] 
with T, Parareal uses the iteration 

c*X\ = e(c k n +1 )-e(c k n ) + ?(c k n ) ( 6 ) 

with k as iteration index. For the first subinterval, i.e. 
for [0,ii], set 

4 = c 0 (7) 

for all k, where Co is the given initial value. Note that 
as the iteration converges and c k +\ — c() +1 approaches 
zero for all n = 0,..., N t — 1, the Parareal solution 
converges to the serial fine solution T(c n ). 

Formulation © introduces concurrency because as 
soon as the iterates are known, the computationally 
expensive evaluation of the fine method can be done in 
parallel over all subintervals. That is, the time spent 
using the fine method in parallel equals the runtime of 
the fine method across a single subinterval instead of 

1 The coarse method is often represented by S, probably 
because of the French word “gros” for coarse. 
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(a) Time t = T /i 6 . 


70.75 

0.5 

0.25 


(b) Time t = T !i. 



(c) Time t = T. 


Fig. 2 Solution at t = T /i 6 (top), t = T /2 (middle), and 
t = T (bottom) where T denotes the lag-time of the problem. 
The block in the front has been removed to allow for a view 
into the interior of the computational domain. 


the full interval [0,F]. However, multiple iterations are 
typically required and the propagation of corrections 
by the coarse method remains serial in time. Speedup 
therefore depends on finding a cheap enough coarse in¬ 
tegrator that still leads to convergence in a small num¬ 
ber of iterations. 


2.2 Speedup from Parareal 


Denote by N c the number of coarse time steps per 
subinterval, by Nf the number of fine time steps per 
subinterval and by N t the number of subintervals, which 
is assumed to be equal to the number of processors in 
the temporal direction. Further, denote by N\ the num¬ 
ber of Parareal iterations and by r c and r f the com¬ 
putational runtime for a coarse or fine time step, re¬ 
spectively. If one assumes that every time step takes 
the same amount of time, speedup from Parareal can 
be modeled by 


S(N t ) < 


1 

T NcT*_ TV;' 

^ + N t ) N { T f + N t 


( 8 ) 


See e.g. J2U] or [5] for a more detailed discussion of this 
bound. 

For larger end times for the brick and mortar setup, 
however, this bound is too optimistic, because as the so¬ 
lution approaches a steady-state, the spatial solver re¬ 
quires fewer and fewer iterations per time step, making 
later time steps cheaper. In the numerical simulations 
presented below, we use a final time T for which the 
solution is still sufficiently far away from the steady- 
state and this effect is minimal, but we also discuss a 
formula valid for non-constant runtimes per time step. 
Here, to simplify the notation, we omit the index range 
for sums, maxima and minima; it is always implied to 
be n = 0,..., N t — 1. Now, denote by 7 ^ and 7 ^ the 
cost of running the coarse and fine method across the 
subinterval [t n ,t n+ 1 ]. Then, a serial run of the fine or 
coarse method amounts to the duration 

r ' = = (9) 

while a Parareal run with TV; iterations costs 


Fp = r c + iVi 7 x , 7 X = max { 7 = + 7 ^ } , ( 10 ) 

n K 


as the runtime for the parallel fine solve will be dom¬ 
inated by the subinterval with the longest simulation 
time. Also, using proper pipelining, the parallel run¬ 
time of the serial coarse correction step will be governed 
by the most expensive subinterval for C. The resulting 
estimate for the speedup of Parareal is then 


S(N t ) 


F f 

F P 


1 


f c 7 X 

— + Ni — 
F f Ff 


( 11 ) 


This is a slight generalization of (|sj) in the sense that if 
7 ° = N c t c and 7 ^ = N{T { are constant for all subinter¬ 
vals, we get 


F c = N t N c r c , 


F f = N t N f T f , 


( 12 ) 
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and 


7 X = N c t c + N f r { 


(13) 


for which (111 simplifies to ([ 8 ]). According to (11), in the 


case of imbalances in the distribution of computational 
load across subintervals, possible speedup is limited by 
the subinterval with the longest runtime for both the 
fine and coarse method. 

The optimal configuration therefore corresponds to 
equal computing times for all subintervals. For explicit 
schemes, where the cost per time step is more or less 
constant, this balance is relatively easy to achieve by 
making sure every subinterval handles the same num¬ 
ber of coarse and fine steps, resulting in the simple 
speedup model For implicit methods, however, cost 
per time step is mainly determined by the cost of the 
spatial solver (typically depending mainly on the num¬ 
ber of iterations), which in turn depends on the a priori 
unknown dynamics of the solution. Therefore, naively 
load balancing Parareal with implicit methods based 
on the number of time steps alone can lead to a sig¬ 
nificant loss in efficiency. Unfortunately, it seems that 
devising a proper load balancing in time for the im¬ 
plicit case is not straightforward and, to the best of 
our knowledge, has not yet been addressed in the lit¬ 
erature. The easiest approach may be to use the infor¬ 
mation from the initial coarse run in Parareal to de¬ 
termine the size of the subintervals but this requires 
a non-negligible amount of implementation, might in¬ 
hibit proper pipelining when requiring synchronization 
at some point and is thus left for future work. 

To illustrate the effect of load imbalances in time 
on speedup from Parareal, Figure [3] visualizes both the 
projected speedup from <© and ( fll| ): the ideal 
sumes a constant coarse-to-fine ratio of 


case as- 


Nf 7n 10' 


(14) 


The imbalanced case artificially increases 73 = (1 + 6 ) x 
10 and reduces 72 = (1 — 6 ) x 10 while keeping all other 
7 )) and all 7 ® unchanged. Here, 6 is an artificial pa¬ 
rameter modeling load imbalance between the second 
and third slice in the formula for projected speedup. 
For 6 = 0, both slices have the same load (’’ideal case”) 
while increasing 6 corresponds to an increasing imbal¬ 
ance in load: for 6 = 1 , the second slice no longer does 
any work while the third slice does twice as much work 
as in the ideal case. Note that the sum if, that is the to¬ 
tal workload, remains constant. Clearly, the introduced 
imbalance has a noticeable detrimental effect on the 
projected speedup. 



Fig. 3 Projected speedup for the ideally load balanced case 
and a case where the second and third slice are imbalanced 
by a factor of b = 0.5 or 6 = 0.75. 


2.3 Weak scaling 

When doubling the spatial resolution, the resulting in¬ 
crease in the computational cost per time step can be 
compensated for by a corresponding increase in the 
number of cores used for the spatial parallelization - 
at least if both the employed method and implementa¬ 
tion show good weak scaling. For the spatial solver and 
parallelization of UG4 applied to the benchmark used 
here, this has been demonstrated successfully in [34] . 
However, when the number of fine time steps Nf is also 
doubled, twice as many time steps have to be computed 
in order to keep the ratio St/5x constant, which leads to 
a doubling of time-to-solution (see also the discussion 
in bd- Time parallelization can provide some mitiga¬ 
tion by also doubling the number of subintervals and 
thus of cores used to parallelize along time - unless this 
leads to a massive increase in the number of required 
iterations. 

However, because of the serial coarse correction, just 
as Parareal cannot achieve ideal strong scaling, it can 
also not provide 100% efficiency in weak scaling. To see 
this, let h denote a parameter governing accuracy of the 
discretization. Estimated runtime for a Parareal run on 
N t cores (in time) is then 

Rzh(N t ) = N t N cT % h + Nf (. N c r c 2h + Nf4 h ) . (15) 

Doubling spatial resolution, i.e. using h instead of 2h 
and performing twice as many fine and coarse steps on 
twice as many subintervals (but keeping the number of 
coarse and fine steps per subinterval constant) gives 

R h (2Af t ) = 2 N t N c r c h + Nf ( N c r c h + N { r f h ) (16) 

assuming the number of iterations does not change. In 
the case of perfect weak scaling in space, by increasing 
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the number of spatial cores the runtime per step can 
be kept constant, that is r 2h = r£ and r? 2h = r ! h . Under 
this assumption, the projected weak scaling efficiency 
of the space-time parallelization is 

R2h{N t ) _ N t a + Nj (1 + a) 

Rh(2N t ) ~ 2N t a + N i (l + a) [ J 

with a := N c t^/N{T^ > 0. Therefore, while weak scal¬ 
ing can never be perfect, a cheap enough coarse integra¬ 
tor (it< 1 ) should still allow for good weak scaling - as 
long as the underlying spatial solver shows good weak 
scaling and the number of iterations is not affected by 
the increasing number of subintervals. 


2.5 Spatio-temporal discretization and solvers 

Discretization in space and time is provided by the soft¬ 
ware package UG4 [35]. We employ a plain vanilla vertex 
centered finite volume scheme in space that is combined 
with an implicit Euler scheme in time. For each time 
step this gives rise to a large linear system of equations, 
where the number of degrees of freedom corresponds to 
the number of vertices of the grid. The solver is a multi¬ 
grid method with three steps of damped (w = 0.6) Ja¬ 
cobi relaxation used for pre- and post-smoothing. More 
details are provided in [31]. The coarse grid problem 
with 7‘581 degrees of freedom was solved using sequen¬ 
tial SuperLU mm- 


2.4 Implementation 

For Parareal we use the C++ library Lib4PrM that uses 
MPI for the necessary communication of volume data in 
time. A straightforward approach to implementing <§ 
is sketched as pseudo code e.g. in [3]. Here, however, we 
use a somewhat more elaborate implementation that 
is based on the following observation. After the first 
iteration on the first subinterval [0, t\], the coarse terms 
in the Parareal iteration § cancel out, resulting in 

c\ +1 = T (c 0 ) (18) 

for k > 1. That is, after one iteration, the first subin¬ 
terval is guaranteed to have converged and the proces¬ 
sors responsible for the first subinterval can “retire”. 
After the second iteration, by the same argument, this 
will then be true for the processors handling the sec¬ 
ond subinterval and so on. After k iterations, the time- 
parallel fine method is guaranteed to have converged on 
the first k subintervals and all processors with an MPI 
rank (in time) smaller or equal to k could in principle 
be used otherwise. Put differently, Parareal converges 
always at least at a rate of one subinterval per iteration 
and when k = n the Parareal method is guaranteed 
to have converged at tk < t n . While leaving processors 
idle according to this implementation does not affect 
runtime negatively, it has the potential to reduce the 
energy cost of a simulation, particularly in combina¬ 
tion with “dynamic voltage and frequency scaling” m- 
This will be studied in a future work m- Also, if not 
enough processors are available to cover the whole in¬ 
terval [0, T] by subintervals of a given size, converged 
processors could pick up subintervals at the end in a 
caterpillar-like way. Such even more involved implemen¬ 
tations are left for future studies, though. Finally, in 
production runs one could also use some tolerance e.g. 
for the updates between iterations or a proper residual 
to decide when the solution at the end of a subinterval 
is converged [25] , 


3 Numerical results 

We report results from solving the brick and mortar 
problem described in ([2] with the simulation framework 
described above. For both C and 2f we use an implicit 
Euler method with the time step size At for C being 
significantly larger than the time step size St for 2r. 

All runs are performed on the Cray XC40 Piz Dora 
supercomputer at the Swiss National Supercomputing 
Centre (CSCS) in Lugano, Switzerland. This supercom¬ 
puter is equipped with 1‘256 compute nodes, each of 
which consists of two 12-core Intel Xeon E5-2690v3 
CPUs, making for a total of 30T44 compute cores j^] Its 
peak performance is 1.254 PFlops, placing it at position 
56 in the Top500 November, 2014 listj^j As compiler we 
used version 4.9.2 of the the GNU compiler collectior]/] 
and, in the following, report runtimes of simulations 
without I/O operations. 


3.1 Convergence of Parareal 


Generally speaking, convergence of Parareal is affected 
by a number of parameters: The time step sizes and 
methods used for C and T, the number of concurrently 
computed subintervals, the discretization used for the 
spatial derivatives, and the dynamics of the problem to 
be solved. To measure convergence, below the relative 
defect between Parareal after k iterations and the 
fine solution run in serial is reported, i.e. 


d k = 


I c k — c 


n = 0, 


,N t - 1 


(19) 


with 


c n = T(c„_i), n = 1,..., N t — 1. (20) 

2 http://user.cscs.ch/computing_systems/piz_dora/ 

3 http://www.top500.org/list/2014/11 

4 https://gcc.gnu.org 
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Fig. 4 The defect dfj _ 1 between the Parareal and the serial 
fine solution at t = T versus the number of Parareal iterations 
for different numbers of subintervals Nt is illustrated. The 
discretization error of the serial fine method is indicated by 
the black horizontal line. 



Fig. 5 The defect at t = T between the Parareal 

Nt = 32 solution and the fine serial solution versus the num¬ 
ber of Parareal iterations for the brick and mortar problem 
(red) and a reference configuration with constant diffusion 
coefficients (blue) is shown. 


In order to avoid distortions through I/O times, only 
the final solution values are written out and the defect 
rfjv t _i is reported except for in section (3.3 There, the 
defect is analyzed not only as a function of the number 
of iterations but also time. 

Figure [5] shows the defect d k Nt _ x versus the number 
of iterations k. In addition, the estimated discretization 
error of the fine method resulting from a comparison 
against a run of T with a time step four times smaller 
than St is shown. Parareal converges rapidly in all con¬ 
figurations. As can be expected, computing more subin¬ 
tervals in parallel slows down convergence somewhat. 
Here, for all N t G {4, 8,16, 32}, one iteration suffices to 
reduce the defect below the discretization error of the 
fine method. This confirms the usability of Parareal also 
for complex diffusion problems with anisotropic geome¬ 
tries and large jumps in the coefficients. Particularly 
the relatively mild reduction of convergence speed as 
Nt is increased illustrates the potential for using larger 
numbers of cores to parallelize in time for this kind of 
problem, should a sufficiently large machine be avail¬ 
able. 


3.2 Effect of spatially varying coefficients 

In the brick and mortar problem, the diffusion coef¬ 
ficients jump between D\ lp = 1 in the lipid channels 
and D cor = 10 -3 in the corneocytes. To assess the im¬ 
pact these jumps have on the convergence of Parareal, 
Figure [5] gives a comparison of the defect for the brick 
and mortar problem (red) and a reference configuration 
with D\ lp = Dcor = 10 -3 throughout the whole domain. 
For the setup studied here, in line with the findings for 


2D problems in El], the jump in coefficients has almost 
no effect on how Parareal convergence. Experiments not 
documented here suggest that a larger T (that is, a fi¬ 
nal configuration closer to the steady state) can lead 
to a larger detrimental effect of coefficient jumps: How¬ 
ever, even there this only resulted in a small number of 
additional iterations required for convergence. 


3.3 Error over time 

So far only the defect at the end of the simulation has 
been reported. In contrast, Figure [6] shows both the de¬ 
fect of Parareal for k = 1 (red) and k = 2 (green) as 
well as the estimated discretization error of the coarse 
(blue) and fine (yellow) integrator. The figure shows the 
defect after every second subinterval for Parareal using 
A} = 32 . 

Already after one iteration, the solution at T = 1 
provided by Parareal has the same quality as when run¬ 
ning the fine method serially. Therefore, speedup is re¬ 
ported below using k = 1. However, one iteration is not 
sufficient to reduce the defect of the whole transient to 
the discretization error: here, two iterations would be 
required after which the green line (Parareal) is com¬ 
pletely below the yellow line (fine integrator). 


3.4 Scaling of Parareal 

Figure [7] shows the speedup from Parareal compared to 
running 'J in serial with the number of iterations chosen 
such that the defect of Parareal at T = 1 is below the es¬ 
timated discretization error of T (for N t € {4,8,16,32} 
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Fig. 6 The discretization error over time for the fine and 
coarse solution, and the defect of Parareal with N t = 32 
subintervals after k = 1 and k = 2 iterations is illustrated. 



Cores 

Fig. 7 Depicted is the speedup of the time parallelization 
with Parareal. 


3.5 Spatio-temporal weak scaling 

Figure [ 8 ] shows convergence of Parareal runs for four 
different setups with increasing spatial and temporal 
resolution but keeping both the number of elements in 
space and time steps per core constant. The first one 
(blue) uses a time step size of At = 1 /s and St = 1 /i 2 s 
in units of T. The second one (red) on the other hand 
uses half the coarse and fine time step and half the 
spatial mesh width so that St/dx and At/Sx are the 
same in both runs. It also uses twice as many cores in 
time and eight times more cores in space, so that both 
the number of elements per core and the number of 
time steps per core remain constant, too. The green and 
yellow line then correspond to runs that again double 
spatial and temporal resolution. Higher spatio-temporal 
resolution leads to smaller defects for Parareal, while 
the rates of convergence (i.e. the slopes of the lines) 
remain roughly the same. 

Exact configurations are shown in Table [Tj note how 
At~ l /N t and St -1 /N t as well as # Elements/P spaC e 
are constant in all configurations. Also, each refine¬ 
ment step halves the estimated fine discretization error, 
which matches the behavior expected for the employed 
first-order discretization. The number of Parareal iter¬ 
ations required for convergence to the accuracy of 'J 
stays constant: every configuration is converged after 
one iteration. Runtimes are increasing as the problem 
size grows, so space-time weak scaling is not optimal. 
Partly, this is because of the overhead from the coarse 
method, see the discussion above, partly because of less 
than optimal weak scaling of the spatial solver. Never¬ 
theless, Parareal helps to mitigate some of the increase 
in runtime from increasing the spatio-temporal problem 
size. 


this means one iteration). The projected speedups for 
ideal load balancing according to ([ 8 ]) are marked by blue 
circles while the projected speedups according to ( 11 ), 
including differences in runtime between subintervals, 
are shown as red squares. Here, runtimes per subinter¬ 
val yjj and 7 ^ are measured experimentally from serial 
runs of C and J. Measured speedups are shown as green 
diamonds. Up to sixteen subintervals, speedup follows 
the projected value reasonably well, but for 32 subin¬ 
tervals noticeable drop-off is observed - in small parts, 
this is due to imperfect load balancing as indicated by 
the difference between the red and blue curve. The ma¬ 
jor part, however, is overhead from communication and 
other factors, which are not incorporated in the speedup 
model. 


4 Conclusions 

Computational modeling of skin permeation is of inter¬ 
est for different applications. However, the full problem 
requires resolution of a vast range of scales, leading to 
enormous computational requirements. Massively par¬ 
allel computers are needed but these require suitable 
parallel numerical methods to be used efficiently. 

This article investigates the applicability and per¬ 
formance of the time-parallel Parareal integrator to a 
relevant precursory problem of skin transport, namely 
a 3D brick and mortar configuration. For this, the C++ 
Parareal library Lib4PrM is integrated as a plug-in into 
the simulation environment UG4 using implicit integra¬ 
tors and a geometric multi-grid as spatial solver. While 
the brick and mortar problem does not yet feature the 
same geometric level of detail as the skin transport 










Numerical simulation of skin transport using Parareal 


9 


Run 

At- 1 

st- 1 

# Elements 

P 

x space 

At 

P botal 

^fine 

d N t 

R [s] 

Factor 

(2t,3 a ) 

8 

128 

277‘440 

3 

2 

6 

lQ-2.9 

lQ-2.8 

132.79 

- 

(4t, 24 s ) 

16 

256 

2‘219‘520 

24 

4 

96 

10“ 3 ' 2 

IQ — 3 -2 

239.34 

1.80 

(8 t ,192 s ) 

32 

512 

17‘756‘160 

192 

8 

1‘536 

10“ 3 ' 5 

10-S.7 

316.83 

1.32 

(16t, l‘536 s ) 

64 

1‘024 

142‘049‘280 

1‘536 

16 

24‘576 

10" 3 ' 8 

IQ- 4 - 3 

508.70 

1.61 


Table 1 Configuration of the runs shown in Figure | 8 | Both the number of coarse and fine time steps per core in time and 
the number of elements per core in space are kept constant in all runs. Here, efi ne indicates the estimated discretization error 
of the fine integrator and djy is the defect after one iteration. R indicates runtime in seconds. Note that Nt is the number of 
subintervals and equal to the number of cores in time. The last column gives the factor between runtimes: ideal space-time 
weak scaling corresponds to a factor of 1 . 0 , ideal spatial weak scaling with no time parallelization corresponds to a factor of 
two while no weak scaling at all would lead to a factor of 8 x 2 = 16 because the simulations use a 3D spatial discretization. 



Fig. 8 The convergence of Parareal for four different con¬ 
figurations is shown. The ratios 5t/Sx and At/5x are kept 
constant and the number of degrees-of-freedom per core as 
well as the number of time steps per subinterval are kept 
constant, too. Thus, e.g. in the run with 8 cores in time and 
192 cores in space the spatio-temporal resolution is twice as 
good as in the run with 4 cores in time and 24 cores in space. 
Note that because the problem is in 3D, doubling the spatial 
resolution requires eight times more cores. 

problem, it already has jumps in the diffusion coeffi¬ 
cients of several orders of magnitude on a highly anis¬ 
otropic domain. The article is an extension of a previous 
study of a 2D problem on a domain with a much simpler 
structure [ f5T| . 

Performance of the space-time parallel solver is stud¬ 
ied in several numerical experiments. It is confirmed 
that Parareal still converges quickly for the brick and 
mortar case. Moreover, strong and weak scaling as well 
as implications for the “trap of weak scaling” are illus¬ 
trated and discussed. 

As the solution of the brick and mortar problem 
approaches a steady state in time, initial guesses for 
the geometric multi-grid become more accurate if time 
steps of constant length are used. This leads to faster 
convergence of the geometric multigrid, which in turn 
induces an imbalance in workload between the different 
processors in time. For the chosen setup, this effect is 


small but by deriving a simple theoretical model, im¬ 
balances in time are shown to have a potentially sig¬ 
nificant effect on parallel efficiency. For Parareal with 
implicit integrators applied to complex PDEs, the re¬ 
sulting load imbalance is an important issue that has 
to be addressed. 

A number of possible directions for future research 
emerge from the experiments presented here. So far, 
coarsening in Parareal was done only in time by us¬ 
ing a larger time step. Better results can be expected 
if the spatial discretization is coarsened simultaneously. 
This requires a closer integration of Parareal with the 
spatial multi-grid solver, in order to provide interpola¬ 
tion and restriction routines. Also, this approach can 
be taken further by interweaving iterations of the time- 
parallel method with iterations of the spatial solver, 
as discussed e.g. for Parareal in ;22] or for PFASST 
in m■ Another important issue that is also connected 
to load balancing is spatial and temporal adaptivity. 
While both can in principle be used in Parareal, they 
greatly complicate the load balancing problem. Finally, 
as energy consumption is becoming a more and more 
important issue in high-performance computing, a thor¬ 
ough benchmarking in terms of energy-to-solution is 
also an important direction for future work. 
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